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Abstract 

This paper presents an approach that brings together game theory with grammatical inference and 



O: 

, discrete abstractions in order to synthesize control strategies for hybrid dynamical systems performing 

O ■ 

tasks m partially unknown but rule-governed adversarial environments. The combined formulation 
guarantees that a system specification is met if (a) the true model of the environment is in the class of 

> 

■ models inferable from a positive presentation, (b) a characteristic sample is observed, and (c) the task 



specification is satisfiable given the capabilities of the system (agent) and the environment. 
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'jh ■ I. Introduction 



A. Overview 

This paper demonstrates how a particular method of machine learning can be incorporated into 
hybrid system planning and control, to enable systems to accomplish complex tasks in unknown 
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and adversarial environments. This is achieved by bringing together formal abstraction methods 
for hybrid systems, grammatical inference and (infinite) game theory. 

Many, particularly commercially available, automation systems come with control user inter- 
faces that involve continuous low-to-mid level controllers, which are either specialized for the 
particular application, or are designed with certain ease-of-use, safety, or performance specifica- 
tions in mind. This paper proposes a control synthesis method that works with — rather than in 
lieu of — existing control loops. The focus here is on how to abstract the given low-level control 
loops [1] and the environment they operate in [2], and combine simple closed loop behaviors in 
an orchestrated temporal sequence. The goal is to do so in a way that guarantees the satisfaction 
of a task specification and is provably implementable at the level of these low-level control and 
actuation loops. 

As a field of study, grammatical inference is primarily concerned with developing algorithms 
that are guaranteed to learn how to identify any member of a collection of formal objects (such 
as languages or graphs) from a presentation of examples and/or non-examples of that object, 
provided certain conditions are met [3]. The conditions are typical in learning research: the data 
presentation must be adequate, the objects in the class must be reachable by the generalizations 
the algorithms make, and there is often a trade-off between the two. 

Here, grammatical inference is integrated into planning and control synthesis using game 
theory. Game theory is a natural framework for reactive planning of a system in a dynamic 
environment [4]. A task specification becomes a winning condition, and the controller takes the 
form of a strategy that indicates which actions the system (player 1) needs to take so that the 
specification is met regardless of what happens in its environment (player 2) [5], [6]. It turns 
out that interesting motion planning and control problems can be formulated at a discrete level 
as a variant of reachability games [/], in which a memoryless winning strategy can be computed 
for one of the players, given the initial setting of the game. 

In the formulation we consider, the rules of the game are assumed to be initially unknown 
to the system; the latter is supposed to operate in a potentially adversarial environment with 
unknown dynamics. The application of grammatical inference algorithms to the observations 
collected by the system during the course of the game enables it to construct and incrementally 
update a model of this environment. Once the system has learned the true nature of the game, 
and if it is possible for it to win in this game, then it will indeed find a winning strategy, no 
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matter how effectively the adversarial environment might try to prevent it from doing so. In 
other words, the proposed framework guarantees the satisfaction of the task specification in the 
face of uncertainty, provided certain conditions are met. If those conditions are not met, then 
the system is no worse off than when not using grammatical inference algorithms. 

B. Related work 

So far, symbolic planning and control methods address problems where the environment is 
either static and presumably known, or satisfies given assumptions [8]-[10]. 

In cases where the environment is static and known, we see applications of formal methods 
like model checking [9], [11]. In other variants of this formulations, reactive control synthesis is 
used to tackle cases where system behavior needs to be re-planned based on information obtained 
from the environment in real time [8]. In [10] a control strategy is synthesized for maximizing 
the probability of completing the goal given actuation errors and noisy measurements from the 
environment. Methods for ensuring that the system exhibits correct behavior even when there is 
the mismatch between the actual environment and its assumed model are proposed in [12]. 

Linear Temporal Logic (LTL) plays an important role in existing approaches to symbolic 
planning and control. It is being used to capture safety, liveness and reachability specifications 
[13]. A formulation of LTL games on graphs is used in [14] to synthesize control strategies 
for non-deterministic transition systems. Assuming an uncertain system model, [12] combines 
temporal logic control synthesis with receding horizon control concepts. Centralized control 
designs for groups of robots tasked with satisfying a LTL-formula specification are found in 
[15], under the assumption that the environment in which the robots operate in adheres to certain 
conditions. These methods are extended [io] to enable the plan to be revised during execution. 

Outside of the hybrid system's area, adjusting unknown system parameters has traditionally 
been done by employing adaptive control or machine learning methods. Established adaptive 
control techniques operate in a purely continuous state regime, and most impose stringent 
conditions (e.g., linearity) on the system dynamics; for these reasons they are not covered in 
the context of this limited scope review — the interested reader is referred to [17], [18]. On the 
other hand, machine learning is arguably a broader field. A significant portion of existing work 
is based on reinforcement learning, which has been applied to a variety of problems such as 
multi-agent control | |. humanoid robots [ ], varying-terrain wheeled robot navigation [21], 
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and unmanned aerial vehicle control [22]. The use of grammatical inference as a sub-field of 
machine learning in the context of robotics and control is not entirely new; an example is the 
application of a grammatical inference machine (GIM) in robotic self-assembly [23]. 

In the aforementioned formulations there is no consideration for dynamic adversarial environ- 
ments. A notable exception is the work of [ ], which is developed in parallel to, and in part 
independently from, the one in this paper. The idea of combining learning with hybrid system 
control synthesis is a natural common theme since both methods originate from the same joint 
sponsored research project. Yet, the two approaches are distinct in how they highlight different 
aspects of the problem of synthesis in the presence of dynamic uncertainty. In [ i ], the learning 
module generates a model for a stochastic environment in the form of a Markov Decision Process 
and control synthesis is performed using model checking tools. In this paper, the environment is 
deterministic, but intelligently adversarial and with full knowledge of the system's capabilities. 
In addition, the control synthesis here utilizes tools from the theory of games on infinite words. 

C. Approach and contributions 

This paper introduces a symbolic control synthesis method based on the architecture of 
Fig. 1(a), where a GIM is incorporated into planning and control algorithms of a hybrid system 
(a robot, in Fig. 1(a)) to identify the dynamics of an evolving but rule-governed environment. The 
system — its boundaries outlined with a thick line — interacts with its environment through sensors 
and actuators. Both the system as well as its environment are dynamical systems (shown as ovals), 
assumed to admit discrete abstractions in the form of transition systems (dashed rectangles). The 
system is required to meet a certain specification. Given its specification (As), an abstraction 
of itself (Ai), and its hypothesis of the dynamics of its environment (^2), the system devises 
a plan and implements it utilizing a finite set of low-level concrete control loops involving 
sensory feedback. Using this sensory information, the system refines its discrete environment 
model based on a GIM, which is guaranteed to identify the environment dynamics asymptotically. 
Figure 1(b) gives a general description of the implementation of learning and symbolic planning 
at the high-level of the architecture in Fig. 1(a). The hypothesis on the environment dynamics is 
at the center of the system's planning algorithm. Through interactions with the environment, the 
system observes the discrete evolution (p^i) of the environment dynamics, and uses the GIM to 

(i) 

construct and update a hypothesized environment model A2 . Based on the environment model. 
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(a) An overview of the architecture (b) Learning and planning with grammatical inference module 

at the higher level. 



Fig. 1: The architecture of hybrid agentic planning and control with a module for grammatical inference. 

the system constructs a hypothesis (model) Q^^^ capturing how the "game" between itself and the 
environment is played, and uses this model to devise a winning strategy (control law) WS^ As 
the environment model converges asymptotically to the true dynamics A2, the winning strategy 
becomes increasingly more effective. In the limit, the system is guaranteed to win the game. 

Definitions 7, 8 and Theorem 5 establish how a game can be constructed from the system 
abstractions of the (hybrid) system dynamics (Ai), the environmental dynamics (A2), and the task 
specification (As). Theorem 4 proves that the hybrid agent can determine whether a winning 
strategy exists, and if it does, what it is. Grammatical inference methods yield increasingly 
accurate models of environmental dynamics (assuming adequate data presentations and reachable 
targets), and permit the system to converge to an accurate model of its environment. Discrete 
backward reachability calculations can be executed in a straightforward manner and can allow 
the determination of winning strategies (symbolic control laws), whenever the latter exist. 

The contribution of this paper is two-fold: (i) it integrates GIMs into hybrid systems for the 
purpose of identifying the discrete dynamics of the environment that evolve and possibly interact 
with the system, and (ii) it uses the theory of games on infinite words for symbolic control 
synthesis, and discrete abstractions which ensure implementation of the symbolic plans on the 
concrete hybrid system. In the paper, both elements are combined, but each element has merit 
even in isolation. A hybrid system equipped with GIM is still compatible with existing symbolic 
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control synthesis methods (including model checking). On the other hand, the abstractions 
methods we utilize here — although requiring strong properties on the continuous components 
dynamics of the hybrid system — offer discrete abstract models which are weakly simulated by 
the concrete systems, irrespectively of whether the latter include a GIM or not. 

D. Organization 

The rest of this paper is organized as follows. Section II introduces the technical background, 
the notation, and the models used. The type of hybrid systems considered and their discrete 
abstractions are presented there. In Section III, we show how the control problem can be 
formulated as a game and employ the concept of the attractor in games for control synthesis. 
Section IV describes first how a GIM can be used to identify asymptotically the dynamics of the 
system's unknown and adversarial environment, and then how this knowledge can be utilized in 
planning and control synthesis. In Section V, we establish the properties of the relation between 
the hybrid system and its discrete abstraction, which ensure that the strategy devised based on 
the discrete model is implementable on the concrete system. Section VI illustrates the whole 
approach through an example robotic application. In Section VII we discuss possible extensions 
of the proposed methodology and compare our grammatical inference to other learning methods. 

II. Technical Preliminaries 

A. Languages and Grammatical Inference 

Let S denote a fixed, finite alphabet, and E", S-", S*, S'^ be sequences over this alphabet of 
length n, of length less than or equal to n, of any finite length, and of infinite length, respectively. 
The empty string is denoted A, and the length of string w is denoted \w\. A language L is a 
subset of S*. A string m is a prefix (suffix) of a string w if and only if there exists a string 
V such that w = uv (w = vu). A prefix (suffix) of length A; of a string w is denoted Pr^^{w) 
(respectively, Sf"'^(w)) and a set of prefixes (suffixes) of a string w of length < A; is denoted 
as Pr-^{w) (respectively, Sf-'^(w)). For w = aia2 ■ ■ ■ cTn ^ S*, the shuffle ideal of w is defined 
as S\{w) := S*criS*o-2- --SV^S*. A string u is & factor of string w iff 3x, y G S* such that 
w = xuy. If in addition |n| = k, then n is a k-factor of w. If is a set, 2^ denotes the set 
of all subsets and 2^^ the set of all finite subsets of E. A string extension function (SEF) is a 
total function, f : S* — t- 2^^. The k-factor function fk : S* — ?■ 2^^^*" maps a word to the set of 
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/c-factors within it. If \w\ < k, fk{w) := {w}, otherwise ffe(w) := {u \ u is a A;-factor of w}. 
This function is extended to languages as fk{L) := UweLffc(^)- 

A semiautomaton (SA) is a tuple A = {Q, T.,T) where Q is the set of states, S is the set of 
alphabet and the transition function is T : Q x T. ^ Q. The elements of S are referred to as 
actions and are thought to initiate transitions at a given state according to T. If T(qi,a) = q2 
(also written as qi A q-^) with a G S, then we say that A takes action a on qi and moves to 
q2- The transition function is expanded recursively in the usual way. Note by definition, these 
SAs are deterministic in transition. For a (semi)automaton A, we define the set-valued function 
r : Q 2^ as T{q) := {a G S | T(g, a) is defined}. A finite state automaton (FSA) is a tuple 
A = {A, I, F) where A = {Q, S, T) is a semiautomaton and I, F C Q are the initial and final 
states, respectively. The language of a FSA is L(A) := {w \ T{I,w) fl F 7^ 0}. For a regular 
language L, deterministic FSAs recognizing L with the fewest states are called canonical. 

For concreteness, let grammars of languages be constructed as the set of possible Turing 
machines &. (Other kinds of grammars are used later, but they are translatable into Turing 
machines.) The language of a particular grammar & is L(iS). A positive presentation of a 
language L is a total function </> : N — )■ L U {#} (# is a 'pause'') such that for every w E L, 
there exists G N such that (f)(n) = w. With a small abuse of notation, a presentation (p can 
also be understood as an infinite sequence 0(1)0(2) ■ ■ ■ containing every element of L. 

Let 0[z] denote the initial finite sequence 0(1)0(2) ... 0(z). Let ©cq denote the set of all 
finitely long initial portions of all possible presentations of all possible languages (i.e., all 0[z] 
for alH G N and for all L). The content of (pli], written content(0[2]), is the set of the elements 
of the sequence, less the pauses. A learner {learning algorithm, or GIM) is a program that takes 
the first i elements of a presentation and returns a grammar as output: &\va : &tq — t- &. The 
grammar returned by 6im is the learner's hypothesis of the language. A learner 6im identifies in 
the limit from positive presentations of a collection of languages C if and only if for all L G £, 
for all presentations of L, there exists a G N such that for all m > n, 0im(0m) = ^ and 
L(C5) = L [25]. A characteristic sample S for a language L and a learner (Sim is a finite set of 
strings belonging to L such that for any 0[z] such that content(0[z]) = S, it is the case that for 
all J > I, (25im(0[j]) = (5 and L((5) = L. 

'Pause # can be understood as "non data." 
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Definition 1 (String extension grammar and languages [26]): Let f be a SEF, and E he a 
set. A string extension grammar 25 is a finite subset of E. The string extension language of 
grammar (3 is Lf{(3) = {w E T,* : f{w) C 6}. The class of string extension languages is 



Definition 2 (String Extension Learner [26]): Let f be a SEF. For all positive presentations 
(f), define (5imf as: (Simf(0[z]) = if z = 0, and 



According to [25], the class of regular languages is not identifiable in the limit from positive 
presentation, but string extension languages — which are subclasses of regular languages — are. 

Theorem 1 ([26]): Learner ^ixxif identifies £f in the limit. 
Many attractive properties of string extension learners are established in [27]. A language L 
is Strictly /c-Local (SL^) [28], [29] iff there exists a finite set S C fj.(xS*K), such that L = 
{w E T.* : fk{y<iwt<) C 5*}, where x, x are the symbols indicating the beginning and end of 
a string, respectively. Obviously, Strictly A; -Local languages are string extension languages. The 
following theorem follows immediately. 

Theorem 2 ([30]): For every k. Strictly A;-Local languages are identifiable in the limit from 
positive presentations. 

Theorem 3 (SL-Hierarchy [ ^ ']): SLi C SL2 C . . . C SLj C SLj+i C . . . SL. 
The implication of Theorem 3 is that any Strictly /c -Local language can be described using a 
SLj grammar, where j > k. Section IV illustrates this argument with the help of an example. 

B. Hybrid Systems and Abstractions 

A hybrid system H is defined as a tuple of objects (for a precise definition, see [32]) that 
includes the domains of continuous and discrete variables, the subsets of initial states in those 
domains, the description of the family of continuous dynamics parametrized by the discrete 
states, and rules for resetting continuous and discrete states and switching between the members 
of the family of continuous dynamics. 

In this paper, we restrict our attention to a specific class of hybrid systems where the continuous 
dynamics have specific (set) attractors [1]. The shape and location of these attractors are assumed 
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dependent on a finite set of continuous parameters that are selected as part of closing the outer 
control loop. Judicious selection of the parameters activates a specific sequence of continuous 
and discrete transitions, which in turn steers the hybrid system H from a given initial state to a 
final desired state. This class admits purely discrete (predicate-based) abstractions. We call these 
particular types of hybrid automata hybrid agents, to distinguish them from general cases. 

Definition 3 (Hybrid Agent): The hybrid agent is a tuple: 

Ha = {Z, S„ L, P, 7r„ AP, U, Pre, Post, s, Ta). 

» Z = X X L is a set of composite (continuous and Boolean) states, where A" C is a 
compact set, and L C {0, l}*^ where r is the number of Boolean states. 

• Eq is a set of finite discrete states (control modes). 

• i : Ed — 7- {1, . . . , /c} is a function, indexing the set of symbols in E^. 

• P C R™ is a (column) vector of continuous parameters. 

• Tii : R™ — R"^% for i = 1, . . . ,k is a finite set of canonical projections, such that p = 

Mp)\...,7r,ipy)\ 

• AP is a set of (logical) atomic propositions over Z x P, denoted A set 
of well-formed formulae WFF [33] is defined inductively as follows: (a) if a G AP, then 
a E WFF; (b) if ai and a2 are in WFF, then so are -lai and ai A a2. 

• f^-.ZxP—i- TX is a finite set of families of vector fields parametrized hy p eP, ^ E L 
and cr G E, with respect to which X is positively invariant. These vector fields have limit 
sets^ parametrized by p and a, denoted L^{p, a). 

• Pre: E^ — t- WFF maps a discrete state to a formula that needs to be satisfied whenever Ha 
switches to discrete state a from any other state. When composite state z and parameter 
vector p satisfy this formula we write (z.p) \= Pre((t). 

• Post: E„ — WFF maps a discrete location to a formula that is satisfied when the trajectories 
of f(j reach an e-neighborhood'' of their limit set. When composite state z and parameter 
vector p satisfy this formula we write (z.p) \= PoST(cr). 

• s: Z X P 2^ IS the reset map for the parameters. It assigns to each pair of composite 
state and parameter a subset of P which contains all values to which the current value of 

^The compactness and invariance of X guarantee the existence of attractive, compact and invariant limit sets [34]. 
'Written L^{p,a) ® B^, where ® denotes the Minkovski (set) sum and is the open ball of radius e. 
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p G V can be reassigned to. 

• Ta'. Z X V X T.a ^ Z X V X T.a is the discrete state transition map, according to which 
{z,p,a) {z,p',a') iff {z,p) \= Post(o-) and {z,p') \= PRE(a') with p' e s{z,p). 
The configuration of Ha is denoted h := [z,p,a], and for each discrete state, we define the 
following subsets of Z x V: ^ := {{z,p) : {z,p) \= Pre((t)} and := {{z,p) : {z,p) \= 
Post (a)}. A transition from cTj to (Tj+i (if any) is forced and occurs at the time instance when 
the trajectory of f^^^ (x, i, p) hits a nonempty intersection of a e-neighborhood of its limit set 
and the region of attraction of o-j+i parametrized by p' (p' not necessarily equals p.) After a 
transition {z,p,a) (z,p',a') occurs, the composite state z evolves into composite state z' for 
which {z',p') \= PoST(cr'). The (non-instantaneous) evolution is denoted z z'. 

We will use a form of predicate abstraction to obtain a coarse, discrete representation of Ha- 
Our abstraction map is denoted Vm : Z xV {0, Ijl-^^l and referred to as the valuation map: 

Definition 4 {Valuation map): The valuation map Vu'- Z x V ^ V C {1, O}'"^^' is a function 
that maps pairs of composite states and parameters, to a binary vector v eV of dimension \ AV\. 
The element at position i in v, denoted v[i], is 1 or if G AV is true or false, respectively, 
for a particular pair {z,p). We write ai{z,p) = v[i], for v G V. 

The purely discrete model that we use as an abstraction of Ha, referred to as the induced 
transition system is defined in terms of the valuation map as follows. 

Definition 5 (Induced transition system): A hybrid agent Ha induces a semiautomaton A{Ha) = 
{Q, E, T) in which (i) Q = Vm{Z x V) is a finite set of states; (ii) S = S„ U {n, . . . , r„}, 
m. < \Q X Q\ is a finite set of labels; (iii) TCQxT^xQ is a transition relation with 
the following semantics: g A g' G T iff either(l) cr G Sa and {3p) {\/z G {z \ VMiz,p) = q}) 
iWz' G {z' I {z',p) h PoST(a)}) [{z,p) h PRE(a), VMiz',p) = q'], or (2) a G S \ and (3p) 
(yz G {z I Vm{z,p) = q}) {3p' G s{z,p),a' G S,) [Vm{z,p') = q', {z,p') h PRE(a')]. 

It will be shown in Section V that Ha and A{Ha) are linked through an equivalence relation 
- observable (weakly) simulation relation. Broadly speaking, the sequences (strings in S^*) of 
discrete states which Ha visits starting from [z, p, a] can be matched by a word w such that 
T{Vm{z,p), w) is defined in A{Ha), and vice versa, modulo symbols in S \ that are thought 
of as silent. When a SA moves from state q to state q' through a series of consecutive transitions 
among which only one is labeled with cr G and all others in S \ S^, then we say that the SA 
takes a composite transition from q to q', labeled with a, and denoted q^ q' . 
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Definition 6 (Weak (observable) simulation [35]): Consider two (labeled) semiautomata over 
the same input alphabet S, Ai = (Qi, and A2 = {Q2, ^,^2), and let C S be a set 

of labels associated with silent transitions. An ordered binary relation on Qi x Q2 is a weak 
(observable) simulation if: (i) is total, i.e., for any qi G Qi there exists q2 E Q2 such that 
{Qi^'12) E 9^, and (ii) for every ordered pair (51,^2) E 9^ for which there exists q[ such that 
Qi ^1 q'l, then 3 (q[, ^2) G 9^ : q2 ^2 Q2- Then A2 weakly simulates Ai and we write A2 > Ai. 

Task specifications for hybrid systems (and transition systems, by extension) may be translated 
to a Kripke structure [36] (see [9] for examples), which is basically a SA with marked initial 
states, equipped with a labeling function that maps a state into a set of logic propositions that 
are true at that state. In this paper we also specify final states, and allow the labeling function 
to follow naturally from the semantics of the valuation map. We thus obtain a FSA As = 
{Qs,^s,Ts, Is, Fg), where Is and Fs denote the subsets of initial and final states, respectively. 
Given the dynamic environment, a system (if^ or A(Ha)) satisfies the specification As if the 
interacting behavior of the system and the environment forms a word that is accepted in As. 

C. Games on Semiautomata 

Here, we follow for the most part the notation and terminology of [ - Chapter 4]. Let 
Ai = {Qi, Si,Ti) represents the dynamics of player 1, and A2 = {Q2, ^2,T2) those of player 2. 
We define the set Jj C as the set of legitimate initial states of Ai, for i = 1, 2 respectively, 
but we do not specify final states in these two SA. The language admissible in Ai is C{Ai) = 
Ugoe/ UqGQ I Ti{qo,w) = q}, which essentially includes all possible sequences of actions 
that can be taken in Ai. Let A = Si U S2. Define an (infinite) game [37] ^($) on A as a set 
$ C A*^ of infinite strings consisted of symbols from the two alphabets Si and S2 taken in 
turns. A play is an infinite string w = aia2 ■ ■ ■ E A'^. Players take turns with player 1 playing 
0"! first by default. In this paper we assume that players can give up their turn and "play" a 
generic (silent) symbol e, i.e. e G S^ and Tj(g, e) = q, W q E Qi. A pair of symbols o-2i~i<^2i 
for z = 1, . . . denotes a round, with any one of the two symbols being possibly equal to e. We 
say that player 1 wins the game if u' G if not, then player 2 wins. A strategy for player 
i in game Q{^) is a function Sj : A* — Sj. Player 1 (2) follows strategy Si (respectively, 
S2) in a play w = 0-1(72 • • ■ if for all n > 1, crsn-i = Si(crio-2 ■ ■ ■ (T2n-2) (respectively, cr2„ = 
S2(cricr2 ■ ■ ■ o-2n~i)). A Strategy for player 1 is a winning strategy WSi if all strings w = aia2 ■ ■ ■ 
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that satisfy (T2n-i = WSi(cricr2 ■ ■ ■ (^2n~2), Vn > 1, belong in $. Winning strategies for player 2 
are defined similarly. If one of the players has a winning strategy, then the game is determined. 

III. Game Theoretic Approach to Planning 
A. Constructing the game 

Consider a hybrid agent having to satisfy a task specification, encoded in a FSA As- Assume 
that this agent is operating in an unknown environment. In the worst case, this environment 
is controlled by an intelligent adversary who has full knowledge of the agent's capabilities. 
The adversary is trying to prevent the agent from achieving its objective. The behavior of the 
environment is still rule-based, i.e. subject to some given dynamics, although this dynamics is 
initially unknown to the agent. 

Assume that the agent has been abstracted to a SA Ai (player 1) and the dynamics of the 
environment is similarly expressed in another SA A2 (player 2). Without loss of generality, we 
assume the alphabets of Ai and A2 are disjoint, i.e. Si 7^ S2. In this game, the agent is not 
allowed to give up turns (e ^ Si) but the adversary that controls the environment can do so 
(e G S2). For two-player turn-based games, the actions of one player may influence the options 
of the other by forbidding the latter to initiate certain transitions. To capture this interaction 
mechanism we define the interaction functions Ui : Qi x Qj — )■ 2^^ , G {(1,2), (2, 1)}. An 
interaction function Ui maps a given pair of states (qi, qj) of players i and j, to the set of actions 
player j is not allowed to initiate at state qj. 

We now define a SA that abstractly captures the dynamics of interaction between the two 
players, by means of a new operation on SA which we call the turn-based product. An intersec- 
tion of the turn-based product with the task specification yields the representation of the game 
and further allows us to compute the strategy for the agent. 

Definition 7 (Turn-based product): Given two SAs for players Ai = ((5i,Si,Ti) and A2 = 
{Q2, S2,T2) with the sets of legitimate initial states Ji, I2 and interacting functions Ui, U2, their 
turn-based product P = {Qp, Si U S2, Tp) is a SA denoted Ai o A2, and is defined as follows: 

• = Qi X (^2 X {0, 1}, where the last component is a Boolean variable c G {0, 1} denoting 
who's turn it is to play: c = 1 for player 1, c = for player 2. 

• Tp({qi, q2, c), cr) = {q[, q2, 0) if c = 1, q[ = Ti{qi, a), with a ^ U2{q2, Qi) and Tp({qi, ^2, c), a) 
= {Qi, 92. 1) if c = 0, q'2 = T2(g2, (t), with a ^ Ui{qi, ^2)- 
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Assuming player 1 is the first one to make a move, the set of legitimate initial states in P is Ji x 
h X {1} and the language admissible in P is C{P) = Ugoe/ix/2x{i} DqeQp I ^p(^o, w) = q} , 
the set of all possible plays between two players. Note that if one includes the silent action e in 
Sj for z = 1, 2, the players may not necessarily play in turns — as in the specific case of agent- 
environment interaction considered here. The product operation is still applicable as defined. 

The turn-based product P gives snapshots of different stages in a game. It does not capture 
any of the game history that resulted in this stage. Often, task specifications encoded in As 
involve a history of actions, and thus the winning conditions for player 1 cannot be encoded in 
P by simply marking some states as final. We overcome the lack of memory in P by taking its 
product with Ag. Taking the product is suggested by the fact that player 1 can win the game 
(i.e. agent can satisfy the specification) only if L(As) n C{P) ^ 0. The technical complication 
is that the two terms in this product are heterogeneous: one is a SA and the other is a FSA. We 
resolve this by transforming the SA into a FSA and applying the standard product operation; 
and the result is what we call the game automaton. 

Definition 8 (Game automaton): The game automaton is a FSA defined as Q = V x As = 
{Q,T.,T,Qq, F), where As = (Qs, S, T^, J^, -Fs) is a FSA encoding the winning conditions for 
player 1, and V is a FSA obtained from the turn-based product P = Ai 0^2 by defining the set of 
initial states of V as the legitimate initial states Ji x /2 x {1}, and marking all other states as final. 
The set of initial states for Q is defined as Qo = {(o'l? 1, Q'os) I Qi ^ h,(l2 ^ h, Qos ^ h}- 
The set of final states for Q is given by F = {{qi, q2, 0, g^) | qs £ Fs}. 

It follows (from the fact that the language of Q is regular) that the game defined by is a 
reachability game [38], and therefore it is determined. Note that the final states of Q are exactly 
those in which player 1 wins the game. On FSA Q, we define the attractor of F, denoted 
Attr(P), which is the largest set of states W ^ F in Q from where player 1 can force the play 
into F. It is defined recursively as follows. Let Wq = F and set 

Wi+i :=W,U{qeQ\q = (gi, ga, 1, Qs), and 3^7 G T{q) : T{q, a) E W^} 

U{qeQ\q = iqi,q2, 0, g,), and Va G r(g) : T(g, a) G W,} . (2) 

The function p : Q —> N; p{q) ^ min{z > | g G Wi] is called the rank function of the game. 

Since Q is finite, there exists the smallest m G N such that Wm+i = Wm- Then Attr(P) = Wm.. 
Moreover, because Q is determined, the complement of Attr(P) in Q forms a trap for player 1; 
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it contains all the states at which player 2 can prevent player 1 from winning the game. Attr(F) 
can be computed in time 0{ni + 77-2) where rii = \Q\ and n2 is the number of transitions in Q. 

B. Computing a winning strategy 

The following statement is straightforward. 

Theorem 4: Player 1 has a winning strategy iff Attr(F) fl Qo 7^ 0- 
Proof: If Attr(F) fl Qo 7^ 0^ the winning strategy of player 1 can be defined as a map 
WSi : Q — so that for q = (gi, q2, 1, Qs), the image of this map is WSi(g) = {a \ T{q, a) E 
Attr(F)}. If the game starts at go ^ Attr(F) fl Qq, by exercising WSi, player 1 ensures that 
subsequent states are within its attractor. ■ 

We refer to Attr(F) fl Qo as the set of winning initial states of Q. Notice that strategy WSi 
keeps player 1 in its attractor, ensuring that it can win the game, but does not necessarily guide 
it into winning. To compute an optimal winning strategy — one that wins the game for player 
1 in the least number of turns — we partition Wm into a set of subsets Vi, i = 0, . . . , m in the 
following way: let Vq = Wo = F and set Vi := Wi \ Wi^i, for all i G {1, . . . , m}. The sets ViS 
partition the attractor into layers, according to the rank of the states that are included. That is, 
Vg G Vi, p{q) = i and thus the {V^i}™,i partition is the one induced by the ranking function. We 
can then prove the following sequence of statements. 

Once the game is in Attr(F), all the actions of player 2, and some of player 1 strictly decrease 
the rank function: 

Lemma 1: For each q G Vi+i, i = 0, . . . , m — 1, if c = 1, then 3 a G Si fl T(q) such that 
T{q,(x) G Attr(F), it is p{T{q,a)) = i. If c = 0, then Va G S2nr(g), such that p{T{q,a)) = i. 

Proof: Let q G V^j+i. According to (2), either (a) c = 1 and so T{q,a) G Wi for some 
a G r(g), or (b) c = and T{q,a) G Wi, \/ a G V{q) . We show the argument for case (a) 
when c = 1 by contradiction: suppose there exists k < i, so that T(q, a) G Vk — by construction 
(2) we already have k < i. Then according to (2), q belongs to Vk+i. But since the sets Vi 
partition Attr(F), V^+i and V^+i are disjoint. Therefore q cannot be in V^+i as assumed in the 
statement of the Lemma. Thus, when c = 1, all actions that enable the player to remain in its 
attractor in fact move it only one (rank function value) step closer to the winning set. A similar 
contradiction argument applies to case (b) when c = 0: Assume that all a G S2 fl r(g) yield 
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T{q,a) G Vj for some j < i. Let k = maXg/^T{q,a) pW)- Then i > k > j, which means that 
k + 1 < i + 1. In the same way we arrive at q ^ Vj+i which is a contradiction. ■ 

Informally, actions of player 1 from Vj+i cannot take the game any closer to F than VJ. This 
implies that the rank of a state expresses the total number of turns in which player 1 can win 
the game from that state. 

Proposition 1: For each q E Vi, there exists at least one word w E L{Q), with \w\ = i such 
that T{q, w) E F. 

Proof: We use induction, and we first prove the statement for i = 1. For each q = 
(gi, g2) 1) Q's) £ Vi, Lemma 1 suggests that at least one action of player 1 which keeps it in 
the attractor, actually sends it to Vq = F. So for z = 1 the plays in which player 1 wins 
have length one. Now suppose the statement holds for i = n; we will show that also holds for 
i = n + 1. According to Lemma 1, for each q E Vn+i, V cr G S2 H r(g) (player 2 taking its 
best action) or for at least one a E Si fl r(g) (player 1 taking its best action) we will have 
T(g, a) E Vn- In other words, if both players play their best, the rank of the subsequent state in 
the game automaton will be n. Inductively, we conclude the existence of a path of length n in 
Q starting at g G and ending in g' G = F. ■ 
Proposition 2: Suppose go = ('?i5'?2, l,Q'so) and that p(go) = k < m. Then player 1 can win 
the game in at most k rounds following the strategy WS^, defined as 

WS^g) = {a I T{q,a) E V,^u qEV,,i>l} . (3) 

Proof: Given a state g = (gi,g2, l,gs) G Vi, WS^ allows player 1 to force the game 
automaton to reach a state in Vi_i by picking action a* such that T(g, a*) = q' where q' E Vi_i 
(Lemma 1). At g', c = 0. Any action of player 2 takes the game automaton to a state q" E Vj 
for j < i — 2. In fact, the best player 2 can do is to delay its defeat by selecting an action a 
such that j = i — 2 (Lemma 1). An inductive argument can now be used to complete the proof. 

■ 

IV. Learning through Grammatical Inference 

In Section III it was shown that the agent can accomplish its task iff (a) it has full knowledge 
of the environment, and (b) the game starts at the winning initial state in Attr(F) fl Qq. The 
problem to be answered in this section is if the environment is (partially) unknown but rule- 
governed, how the agent plans its actions to accomplish its task. By assuming the language of 
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the environment is learnable by some GIM, we employ a module of grammatical inference to 
solve this problem. 

A. Overview 

The theory of mind of an agent refers to the ability of the agent to infer the behavior of its 
adversary and further its own perception of model of the game [39], [40]. In the context of 
this paper, the agent initially has no prior knowledge of the capabilities of its adversary and 
plans a strategy based on its own hypothesis for the adversary. Therefore, although the agent 
makes moves which keep it inside the hypothesized attractor, in reality these moves might take 
it outside the true attractor. Once the agent has departed its true attractor, then it is bound to fail 
since the adversary knows the true nature of the game and can always prevent the agent from 
fulfilling its task. 

An agent equipped with a GIM is able to construct an increasingly more accurate model of 
the behavior of its adversary through consequent games (Fig. 1(b)). The expected result is that 
as the agent refines the model it has for its environment and updates its "theory of mind," its 
planning efficacy increases. We expect that after a sufficient number of games, the agent should 
be able to devise strategies that enable it to fulfill its task irrespective of how the adversary 
proceeds. This section presents the algorithms for constructing and updating this model. 

B. Assumptions and Scope 

In the agent-environment game, the behavior of the unknown environment becomes a positive 
presentation for the learner. The hypothesis obtained by the learner is used for the agent to 
recompute the game automaton and the attractor as described in Section III. It is therefore 
guaranteed that the agent's hypothesis of the unknown environment will eventually converge to 
the true abstract model of the environment, provided that (i) the true model lies within the class of 
models inferable by the learner from a positive presentation, and (ii) the unknown environment's 
behavior suffices for a correct inference to be made (for example if a characteristic sample for 
the target language is observed). 

We make the following assumption on the structure of the unknown discrete dynamics of the 
adversarial environment: 
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Assumption 1: The language admissible in the SA A2 of the adversarial environment (player 
2) is identifiable in the limit from positive presentation. 

Although the results we present extend to general classes of systems generating string extension 
languages, for clarity of presentation we will focus the remaining discussion on a particular 
subclass of string extension languages, namely Strictly k-Local languages (SL^) [2v], which has 
been defined in Section II-A. 

C. Identifying the Class of the Adversary 's Behavior 

As suggested by Theorem 2, in order to identify the behavior of the adversary, which is 
expressed in form of a language, the agent must know whether this language is SL and if it is, 
for which k in SL hierarchy. We assume the information is provided to the agent before the game 
starts. We employ the algorithm in [41] adapted for SA to check whether a given SA admits a 
SL language.*^ In what follows we provide a method for determining the natural number k: 

For some A; > 0, consider a (non)-canonical FSA that accepts S*: Vk = {Qd, S, Td, {A}, Fd), 
where (i) Qd = Pr-^'^^(S*); (ii) TD{u,a) = Sf=^"^(na) iff \ua\ > k - 1 and ua otherwise; 
(iii) A is the initial state, and (iv) Fd = Qd is, the set of final states (all states are final). 
We refer to Vk as the SL^-FSA for S*. It is shown [42] that for a given a SL^ language 
with grammar &, a (non)-canonical FSA accepting L{&) can be obtained by removing some 
transitions and the finality of some of the states^ in V^. We call the FSA of a SL^ language 
L(0) obtained in this way, the SL^-FSA of L((25). Figure 2(a) shows a SL3-FSA for S*, with 
S = {a,b}. Figure 2(b) shows another SL3 grammar that generates the language given by the 
string extension grammar & = {y<iaa, y<iab, aab, aaa, aba,bat<}. For example, aaba E L((S) 
because f3(xaa6ax) = {y<iaa, aab, aba,bat<} c &. Yet aababa ^ as f3{y<iaababat<) = 

{xaa, aab, aba, bab, bat<} ^ 0, in fact the 3-factor bab ^ (5. 

''This algorithm worlcs with the graph representation of a FSA and therefore it is not necessary to designate the initial states. 
^Removing finality of a state q in FSA A means to remove q from the set of final states in A. 
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Fig. 2: The (non)-canonical FSA P3 accepting E* for E = {a, b} (left) and the SL3-FSA obtained for £(©), where 
25 = {xiaa, yiab,aab,aaa,aba,bax}, after removing transitions and the finality of some states (right). 

In a FSA, we say g G Q is at level i iff i = min{|u;| | w E S*, T(go, w) = q}, where go is an 
initial state. The function 7 : Q — ;> N maps a state q to its level. Now we can state the following. 

Lemma 2: If a canonical FSA C = {Qc,T.,Tc, qoc, Fc) accepts a SL language L for some /c 
where k is the smallest number such that L(C) E SL^, then k < maxg^p^ 7(g) + 1. 

Proof: Let be a SL^ grammar that generates L. Then we can generate a (non)-canonical 
FSA B = {Qb, S, Tfo, {A}, F;,) by removing transitions and finality of nodes from V^. Let g* = 
argmaxggi?^ 7(g) be a state in C furthest from the initial state, let n = 7(g*) be its level, and 
w = W1W2 ■ ■ ■ w;„ be a word that brings C to state g* = T(goc, w). FSAs B and C accept the 
same languages, so w G L{C) iff w G L{B). In B, however, we can compute a k, because 
Tb{X, w) = Sr''-\w) E Fb with k - 1 < n, i.e. k<n + l. ■ 

Though we can only obtain an upper bound A;niax = niax^g^^ 7(g) + 1 on the smallest k (in the 
worst case this bound is \Qc\), the hierarchy of SL language class given by Theorem 3 guarantees 
that this upper bound A^max is sufficient for us to obtain a correct SL^^^^^^ grammar that generates 
the exact language presented to the learner, irrespectively if this language can also be generated 
by a SLfc grammar for some k < /cmax- For example, for the language accepted by the FSA in 
Fig. 2(b), we can also obtain a SL4 grammar & = {xaaa, xa6a, yiaab,aaha,aaab,aba\K} and 
it can be verified that L((J5') = L((S). 

D. Learning the Adversary 's Dynamics 

Before the game starts, player 1 is informed that the behavior of its adversary is a SL^ 
language for some known k and the adversary can always give up a turn, i.e. e G T.2. With this 
knowledge, player 1 builds a SL^-FSA for {S2 \ {e}}*. Then, by unmarking initial and final 
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States and adding a self-loop labeled e at each state, it obtains an initial model of its adversary 

In the course of game, player 1 (agent) records the continuous sequence of actions of player 2 
(the environment). This amounts to a presentation (p of the form: 0(0) = A, 0(z + l) = (p{i)a, i > 
1,2 G N, for some a G r{T{qQ,w)) fl S2 7^ where go ^ Qo and w = 0(0-'' The learning 
algorithm is applied by player 1 to generate and refine the hypothesized model of its adversary 
from the presentation (p. 

Since a FSA for any SL^ grammar can be generated by removing edges and finality of nodes 
in the SL^-FSA for S*, then the SA for player 2 can be obtained by just removing edges in 
Af\ Due to this special property, we can use an instrument with which the agent encodes 
new knowledge into the hypothesized model for the adversary, namely, a switching function sw, 
which operates on a SA (or FSA) and either blocks or allows certain transitions to take place: 
sw : (^2 X S2 — )■ {0, 1}, so that for q G Q2, cr G T{q) only if sw(g, a) = 1. Consequently, at 
round i + 1, the incorporation of new knowledge for A2 obtained at round i redefines sw. We 
assume a naive agent that starts its interaction with the environment believing that the latter is 
static (has no dynamics). That hypothesis corresponds to having sw*^°'(g, cr) = 0, Vcr G S2 \ {e} 
and sw(o)(g,e) = l,Vg G Q2. 

Note that (p(i) denotes the presentation up to round i. The initialization of the game can be 
considered as a single round played blindly by both players (without any strategy). Hence, if the 
game starts with ((gi, g2, 1), Qos), it is equivalent to have 0(1) = a, for which T2(A, a) = ^2- Let 
sw*^*^ denote the refinement of sw made at round i, suppose that at round i + 1, the adversary 
plays a'. This suggests 0(i + 1) = (j){i)a' . Suppose g2 = ^2(^,0(2)), then for all q E Q2 and 
a G Tj2, sw'^*"''^^ is defined by 

(m). ^ fsw«(g,a) if (g, a) ^ (^2, a') 
sw^*+^^(g,(T) = <j (4) 

[1 if (g,a) = (g2,a') 

meaning that the transition from g2 on input a' in A2 is now enabled. With a small abuse of 
notation, we denote the pair ^^2°^ sw*^*^ j = \ read as the SA A^2^ with switching function 
sw^*^ Pictorially, is the SA obtained from Ag^'' by trimming the set of transitions which are 
switched off (sw(-) = 0). 

*This is a map I.E2: S* — ^ E2. The image w I.E2 is the string after removing all symbols in w which are not in S2. 
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Correspondingly, the game automaton in the initial theory of mind of the agent is constructed 
as ^(0) = X As) where is the FSA obtained by P(°) = AioA^^^ after setting Ji x x {1} 
as the set of legitimate initial states, where h = {q \ T2{X, cr) = g, cr G S2 \ {e}}? and all other 
states in p(°) as final. By the construction of game, the switching function associated with 
can be extended naturally to = (^(°), sw*^'^) by: 

Vg = (gi,g2,0,g,),a G S2, sw(')(g, a) = 1 (or 0) in ^« iff sw(*)(g2,a) = 1 (or 0) in A^^\ 

(5) 

With the extension of switching function, one is able to update the game automaton without 
computing any product during runtime. This is because the structure of the game has essentially 
been pre-compiled. This results in significant computational savings during runtime, depending 
on the size of Af\ 

This switching mechanism along with the extension from A2^ to can be applied to other 
classes of string extension languages, in particular any class of languages describable with FSAs 
obtainable by removing edges and finality of states from some deterministic FSA accepting S*. 

E. Symbolic Planning and Control 

With the theory of mind as developed in round i, and with the game automaton at state q, 
the agent computes an optimal winning strategy WS^ based on (3), by setting Wq = Vq = F 
and iteratively evaluating (2), where sw*^*^ defined in Q^'-^ has to be taken account of: for all 
(g, cr) G Q X S, if sw('^(g,cr) = 0, then a ^ r(g). The computation terminates when the 
following condition is satisfied: 

3mGN: q e Wm V q ^ Wm = Wm+i ■ (6) 

When q G Wm, WS* can be computed at q. Then based on Proposition 2, the strategy ensures 
victory in at most m turns. The agent implements this strategy as long as its theory of mind 
for the adversary remains valid, in other words, no new transition has been switched on. In the 
absence of new information, the plan computed is optimal and there is no need for adjustment. If 
in the course of the game an action of the adversary, which the current model cannot predict, is 
observed, then that model is refined as described in Section IV-D. Once the new game automaton 
is available, (2)-(3) are recomputed, and (6) is satisfied. 
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If instead q ^ Wm = Wm.+i, then the agent thinks that q G Attr(F)'^: the agent is in the trap 
of its adversary. If the adversary plays its best, the game is lost. It should be noted that this 
attractor is computed on the hypothesized game and may not be the true attractor. Assuming that 
the adversary will indeed play optimally, the agent loses its confidence in winning and resigns. In 
our implementation, when the agent resigns the game is restarted at a random initial state go ^ Qo^ 
but with the agent retaining the knowledge it has previously obtained about its adversary. The 
guaranteed asymptotic convergence of a string extension learner ensures that in each subsequent 
game, the agent increases its chances of winning when initialized at configurations from which 
winning strategies exist. The adversary can always choose to prevent the agent from learning by 
not providing new information, but by doing so it compromises its own strategy. 

The following section illustrates how the methodology outlined can be implemented on a 
simple case study, and demonstrates the effectiveness of the combination of planning with string 
extension learning. As it turns out, the identification of the adversary's dynamics is quite efficient 
in relation to the size of A2. 

V. Refinement on Hybrid Dynamics 

Section IV established a methodology based on which the agent can concurrently learn and 
(re)plan an optimal strategy for achieving its objective, in a partially known and adversarial 
environment. This section addresses the problem of implementing the optimal strategy on the 
concrete dynamics of the hybrid agent Ha as given in Definition 3. 

Proposition 3: Every transition labeled with r G S \ must be followed by a transition 
labeled with some a G S^, i.e., every silent transition in A(Ha) must be followed by an 
observable one. 

Proof: Assume, without loss of generality that the r transition appears somewhere between 
two observable transitions cri, a2 G Sq. We will show that r is the only silent transition that 
can "fit" between ai and a2, in other words we can only have q ^ qi q2 q' for some 
q, qi, q2, and q' G Q. For that, note that by definition, q must be such that for all {z,p) 
giving Vm{z,p) = q, {z,p) |= Pre((Ti); similarly qi must be such that for all {z',p) giving 
Vm{z',p) = qi we should have {z',p) |= PoST(cri). Now suppose that there is another silent 
transition r', in addition to r between ai and (T2 and for the sake of argument assume that 
it comes right after r: q ^ qi ^ q" ^ q'" ■ ■ ■ q2 ^ q' ■ With the r transition following cxi 
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we have by definition that there exists a -p' such that once the r transition is completed it is 
1= PRE(cr') for some o' e Sa- Since {z',p) \= PoST(ai) and {z',p') |= Pre(o-'), we have 
by Definition 3 that Ha makes a transition from {z',p', ai) to {z\p\ a'), and then the continuous 
component dynamics f^/ is activated yielding z' "-^^ z" for some {z",p') \= PoST(cr'). This 
time, with {z',p') \= Pre((t') and {z",p') \= PoST(cr'), it follows that there is a a' transition 
in A{Ha) taking q" ^ q', and a' = 02 because there cannot be more than two observable 
transitions between q and q' by assumption. Therefore, r is the only silent transition that must 
have occurred while A{Ha) moved from q to q' . ■ 

Due to Proposition 3, without loss of generality we will assume that a composite transition 
consists of a silent transition followed by an observable transition, q ^ q' <^=^ q ^ q" q' . 

Theorem 5: Let = S\Saj the hybrid agent Ha weakly simulates its induced semiautomaton 
A{Ha) {Ha > A{Ha)) in the sense that there exists an ordered total binary relation 9^ such that 
whenever (g, z) e 9^ and q ^ q' for some q' G Q, then 3z' E Z : z --^ z' such that (g', z') G fH. 

Proof: If (g, z) G then there exists p^ eV such that Va/(z,p°) = q. In general, p^ ^ p. 
Using the convention adopted above for the composite transition, we write q ^ q' q ^• 
q" A q' with ci G Sa and r G S \ S^. The transition q A q", by definition, implies that for 
all z such that VA.f(z,p°) = q, there exists p G and cr' G Sa such that Va/(z,p) = q" 

with 1= Pre((t'). With q" A g' assumed, we have by definition that for all z such that 

Vm{z,p) = q" it should be Vai{z',p) = q' for all z' satisfying {z',p) \= POST (a). (Note that this 
is the same p G s{z,p^) that appeared before, because there can only be one silent transition 
before an observable one and only silent transitions change the parameters.) From Definition 3 

cr[p] 

we then have that z z', and {z', q') G 9^ because Vm{z',p) = q' . ■ 
We have thus shown that whatever sequence of labels is observed in a run of A{Ha), a 
succession of continuous component dynamics with this same sequence of subscript indices can 
be activated in Ha- Thus, whatever strategy is devised in A{Ha), has a guaranteed implementation 
in the concrete dynamics of the hybrid agent. The issue of selecting the parameters so that the 
implementation is realized is not treated here. This subject is addressed, using slightly different 
discrete models, in [43]. 
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VI. Case Study 

A. Experimental Setup 

To demonstrate the efficacy of our methodology, we consider a game, played between a robot 
and an intelligent adversary. The purpose of the robot (hybrid agent) is to visit all four rooms in 
the triangular "apartment" configuration of Fig. 3. The four rooms in this triangular apartment 
are connected through six doors, which an intelligent adversary can close almost at will, trying 
to prevent the robot from achieving its goal. Table I shows three possible rule regimes that the 
adversary could use. Initially the robot is capable of distinguishing closed from open doors, but 
it does not know which doors can be closed simultaneously. In fact, it assumes that only the 
initially closed doors are ones that can be closed. 

Rules Description 

Opposite Only one pair of doors opposite to each other can be closed at any time: 

{a,rf}, {a,e}, {a,/}, {6,/}, {c,e}, {e, /} 

Adjacent Only one pair of doors adjacent to each other can be closed at any time: 

{a, 6}, {a,c}, {6,c}, {6,4, {6,e}, {c,rf}, {c,/}, {rf,e}, {dj] 

General Any pair of doors can be closed at any time. 

TABLE I: Some possible rules for the adversary (controlling the doors): at each round, the environment either 
keeps static or opens exactly one door in the closed pair of doors and closes exactly one, which results in another 
pair of closed doors. 

The Khepera II, manufactured by K-Team Inc., is a differential-drive mobile robot, with 
two actuated wheels and kinematics that are accurately represented by the equations of a 
unicycle. Motion control is achieved through PID loops that independently control either angular 
displacement or speed of the two wheels. These PiD loops can support the development of 
mid-level motion planning controllers. For example, input-output feedback linearization of the 
unicycle dynamics [ ] leads to a fully actuated reduced system of the form q = u, where the 
sequential composition flow-through approach of [45] can be applied to produce controllers that 
steer the robot from room z to a neighboring room j. This same approach has been used in [46] 
to generate discrete abstractions for the purpose of finding Waldo; details on how the sequential 
composition approach can give rise to finite state automata abstractions are found in [47]. 
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(a) The triangle room game representation. (b) A piiysical implementation of the game. 



Fig. 3: The non-cooperative game used in this case study. Figure 3(a) is a graphical depiction of the triangular 
apartment game, while Fig. 3(b) shows a physical realization of the scenario, with a Khepera II miniature 
mobile robot in the role of the hybrid agent. The robot localizes itself and observes which doors are closed (door 
closure implemented manually using the yellow caution cones) through a VICON™ motion capture system. The 
grammatical inference module and the strategy computation algorithm have been implemented in python, which 
communicates with the control for the robot (through Matlab™) over a serial link. 

For the case at hand, we can use the flow-through strategies to generate potential field-based 
velocity controllers to realize transitions from room i to room j in a way compatible to the 
requirements on the continuous dynamics of the hybrid agent of Definition 3, that is, ensure that 
Pre (a) is positively invariant for f^, and that trajectories converge to L^{p,a) © in finite 
time (see [47]). The latter set is in fact the formula for PoST(cr): x G L'^i'p, a) (B Be- 

In the context of the flow-through navigation strategy of [45], a transition from, say, room 1 
to room 2 (see Fig. 3) would involve a flow-through vector fleld [45] by which the robot exits 
the polygon outlining room 1 from the edge corresponding to door a (slightly more sophisticated 
behavior can be produced by concatenating the flow-through policy with a convergent [45] one 
that "centers" the robot in room 2.) 

The hybrid agent that is obtained by equipping the robot with these flow-through policies can 
be defined as a tuple Ha = {2, S^, l, V, Hi, AV, fa, PRE, Post, s, Ta) where 

• Z h the triangular sector of consisted of the union of the areas of the four rooms. 

. = {(1,2), (1,3), (1,4), (2,1), (2,3), (2,4), (3,1), (3,2), (3,4), (4,1), (4,2), (4,3)}, 
with each element associated with a single flow-through policy: (i, j) denotes a flow-through 
policy from room i to room j. 
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• L : — > {1, 2, 3, 4} where we slightly abuse notation and define l not as a bijection 
but rather a surjection, where we abstract away the room of origin and we maintain the 
destination, for simplicity. 

• TTj = TT = / (the identity), V = Z, and s(z,p) = V, \J{z,p) E Z x V; in this case we do 
not have to use parameters explicitly — they are hard- wired in the flow-through policies. 

• AV = {ai : robot in room i}, i = 1, 2, 3, 4. 

» fa = K{Xa — g). A' > 0, a simple proportional controller on velocity intended to align the 

system's vector field with the flow-through field X„. 
. PRE((i,-)) =a„ i G {1,...,4} and POST ((-,7)) = oij, 3 E {1,...,4}. 

• Ta following Definition 3, once all other components are defined. 

One can verify by inspection when constructing A{Ha), that the first element of cr = 
is encoded in the label for the discrete state, a^, from which the transition A aj. Thus, to 
simplify notation, we change the label of a state from ai to i, and the label of the transition from 
(i, j) to just j — the destination state. We write i ^ j instead. Figure 4 (left) gives a graphical 
representation of A(Ha) after the state/transition relabeling, basically expressing the fact that 
with all doors open, the robot can move from any room to any other room by initiating the 
appropriate flow-through policy. 

B. Results 

Suppose the adversarial environment adheres to the Opposite rule in Table I. The SA Ai for 
the agent (player 1) and a fragment of SA A2 modeling the environment (player 2) are shown in 
Fig. 4.^ By assigning Ji = Qi and I2 = Q2, the game can start with any state in Qi x Q2 x {I}- 

The goal of the agent in this example is to visit all four rooms (in any order). Therefore, 
the specification can be described by the union of shuffle ideals of the permutations of 1234. 
In this special case, since the robot occupies one room when game starts. As = {Qs,^s = 
Si U S2,T„/, = {1,2,3,4},F, = {1234}). A fragment of A is shown in Fig. 5. 

The interaction functions follow from obvious physical constraints: when the environment ad- 
versary closes a door, the agent cannot then move through it. The interaction function U2{did2, r) 
gives the set of rooms the agent cannot access from room r because doors di and ^2 are closed. 

^SAs Ai and A2 happen to be Myhill graphs, but the analysis presented apphes to general SAs. 
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3 «/ 
Fig. 4: Semiautomata for the agent (left) and for a fragment of the environment (right). In Ai, the states are the 
rooms and the transitions are labeled with the rooms that the agent is to enter For A2, the states represent the pairs 
of doors that are currently closed and a transition xy indicates the pair of doors x, y are to be closed. 

X,\,2 



1,2,4 




x,l,2,3,4 



x,l,3,4 



Fig. 5: Fragment of As = {Qs,^s = Si U S2,T,,/, = {1,2,3,4},F, = {1234}), where x = S2. 

In Fig. 3(b), for instance, U2{ab,l) = {2,3}. In this example, the agent cannot enforce any 
constraints on the adversary's behavior, so Ui{q) = 0, Vg G Qi x Q2. Figure 6 shows a fragment 
of Ai o A2, while a fragment of the game automaton Q is shown in Fig. 7. 



af 





Fig. 6: Fragment of turn-based product P ^ Ai o A2 — {Qp, Si U S2, Tp). State (r, did2,c) means the agent is in 
room r, doors {^1,^2} are closed and the Boolean variable keeping track of whose turn it is set to c. 



Let us show how Proposition 2 applies to this case study. The winning set of states is F = 
{((Q'i;'?2, 0), 1234) G Q I (gi,g2,0) G Qp}; Attr(F) is obtained by computing the fixed-point 
of (2). Due to space limitations, we only give a winning path for the robot according to the 
winning strategy WS^ with the initial setting of the game in Qq. 
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Fig. 7: Fragment of the game automaton Q = (Q, Si U J^2,T,Qo, F) for the door-robot game, where Qq = 
{{qi,q2,l,qs) \ qi E h,q2 G h.qs = 9i £ {1,2,3,4}} and F = {(qi, gs, 0, 1234) | (gi,(j2,0) G Qp}, note that 
upon initialization of a game, the state of Ai (the room occupied by the robot) determines the choice of initial state 
in As (the room visited by the robot.) 

If the agent were to have complete knowledge of the game automaton, it could compute the 
set of initial states from which it has a winning strategy: 

go n Attr(F) = {(1, ad, 1, 1), (1, ce, 1, 1), (2, ad, 1, 2), (2, hf, 1, 2), (4, ce, 1,4), (4, 6/, 1, 4)}. 

Hence, with complete game information, the robot can win the game starting from initial 
conditions in Qo n Attr(F); note that ''^"^q"|'^^^' makes up a mere 25% of all possible initial 
configurations. For instance, the agent has no winning strategy if it starts in room 3.*^ 

For the sake of argument, take go = 1, 1) G Attr(F) fl Qq. Since the rank of go is 

p(go) = 7, following WS^ of (3) the robot's fastest winning play is 

(1, ad, 1, 1) 4 (4, ad, 0, 14) 4 (4, ae, 1, 14) A (2, ae, 0, 124) 4 

(2, ce, 1, 124) 4 (1, ce, 0, 124) 4 (1, ef, 1, 124) A (3, ef, 0, 1234) . 

The adversary's moves, ae, ce and ef, are selected such that it can slow down the process 
of winning of the robot as much as possible; there is no move the environment can make to 
prevent the agent from winning since the initial state is in the agent's attractor and the agent 
has full knowledge of the game. Note that in the cases where the game rules are described by 
Adjacent and General regimes (see Table I), the robot cannot win no matter which initial 

^Although the construction assumes the first move of the robot is to select a room to occupy (because it begins in state 0), 
we assume the game begins after the robot has been placed and the closed doors have been selected. 
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State is in because in both cases Attr(F) fi Qo = 0- In these game automata, the agent, even 
with perfect knowledge of the behavior of the environment, can never win. 

Let us show how a robot, which has no prior knowledge of the game rules but is equipped 
with a GIM, can start winning the game after a point when it has observed enough to construct a 
correct model of its environment. As the first game starts, the agent realizes that the environment 
is not static, but is rather expressed by some (discrete) dynamical system, a SA A2. It assumes 
(rightfully so in this case) that the language admissible in A2 is strictly 2-local. With these 
knowledge, the robot's initial hypothesis of the environment Af^ = (((52, S2, T2), sw'^°)) is 
formulated in two steps: (i) obtain the SL2-FSA for {S2 \ {e}}* and assign sw^°)(g, a) = 1, Va G 
S2 \ {e}; (ii) add self-loops T2{q, e) = q and let sw(°)(g, e) = 1 , Vg G Q2. 

In every round, the agent does the best it can: it takes the action suggested by the strategy 
WS^ constructed based on its its current theory of mind. Each time it observes a new action on 
the part of its adversary, it updates its theory of mind using (4), recomputes WS* using (3), and 
applies the new strategy in the following round. The agent may realize that it has lost the game 
if it finds its current state out of the attractor computed based on its most recent theory of mind. 
In this case, the agent resigns and starts a new game from a random initial condition, keeping 
the model for the environment it has built so far and improving it as it goes. We set an upper 
limit to the number of games by restricting the total number of turns played to be less than n. 

The following simplified algorithm illustrates the procedure. 

1) Let i = 0, the game hypothesis is Q^^\ The game starts with a random qq E Qq. 

2) At the current state q = (gi, q2, 1, qs), if the number of turns exceeds the upper limit n, the 
sequence of repeated games is terminated. Otherwise, the robot computes Attr(F) based 
on Q^^^ (note that it is not necessary to compute Attr(F) and WS^(g) as long as there is 
no update in Q^^'^ from the previous round.) Then, according to Attr(F) and (6), the robot 
either makes a move a G WS^(g) or resigns. If a move is made and T(q, a) G F, the robot 
wins. In the case of either winning or resigning the game, the robot restarts the game at 
some go ^ Qo with a theory of mind yl2^ and a hypothesized game automaton then 
its control goes to Step 2. Otherwise, it goes to Step 3. 

3) The adversary takes some action. The robot observes this action and determines whether 
to switch on a blocked transition. If a new transition in A2^ is observed, it updates A2'' 
to A''2^^\ Then g^'^ is updated to ^(*+^) according to (5). Otherwise, = A^*^ and 
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_ gii)_ -Yhe robot sets i = i + 1 and goes to Step 2. 
We can measure the efficiency of tlie learning algorithm by computing the ratio between 
transitions that are switched on during the game sequence versus the total number of enabled 
transitions in the true game automaton. The convergence of learning is shown in Fig. 8(a) and the 
results show that after 125 turns including both robot's and environment's turns (approximately 
42 games), the robot's model of the environment converges to the actual one. 




Num of games Num of wins 



No learning 
With learning 



300 
300 




79 



(a) The convergence of learning algorithm. The figure 
shows the ratio of adversary transitions that have been Full knowledge 300 82 

identified by the agent versus the number of turns the (b) Comparison results with three types of the robot, 
two players have played. In just 125 turns the hybrid For the case of "no learning," the robot eventually 
agent has full knowledge of its adversary's dynamics, moves out of its attractor and gets trapped. 

Table 8(b) gives outcomes of repeated games in three different scenarios for the robot: (a) Full- 
Knowledge: the robot knows exactly the model of the environment; (b) No Learning: the robot 
has no knowledge of, and no way of identifying the environment dynamics, and (c) Learning: the 
robot starts without prior knowledge of environment dynamics but utilizes a GIM. The initial 
conditions for the game are chosen randomly. In the absence of prior information about the 
environment dynamics, and without any process for identifying it, the robot cannot win: in 300 
games, it scores no victories. If it had full knowledge of this dynamics, it would have been 
able to win 82 out of the 300 times it played the game, a percentage of 27%, which is close 
to the theoretical value of 25%. A robot starting with no prior knowledge but uses its GIM 
performs just as well (reaching a win ratio of 26%) as one with full knowledge. In fact, as 
Fig. 8(a) suggests, the robot has recovered the performance of an "all-knowing" agent in less 
than 15% {-^) of the number of games played repetitively used in Table 8(b). We demonstrate 
the planning and control of the robot using KiKS simulation environment in Matlab™.'^ 



A simulation video is available at http://research.me.udel.edu/~btanner/Project_figs/newgame.mp4. 
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VII. Discussion and Conclusions 

This paper shows how the use of grammatical inference in robotic planning and control allows 
an agent to perform a task in an unknown and adversarial environment. Within a game-theoretic 
framework, it is shown that an agent can start from an incomplete model of its environment 
and iteratively update that model via a string extension learner applied to the language of 
its adversary's turns in the game, to ultimately converge on the correct model. Its success is 
guaranteed provided that the language being learned is in the class of languages that can be 
inferred from a positive presentation and the characteristic sample can be observed. This method 
leads to more effective planning, since the agent will win the game if it is possible for it to do 
so. Our primary contribution is thus a demonstration of how grammatical inference and game 
theory can be incorporated in symbolic planning and control of a class of hybrid systems with 
convergent closed loop continuous dynamics. 

The architecture (framework) we propose is universal and can be seen as being composed 
of two distinct blocks: Control synthesis and Learning. The contents of these blocks can vary 
according to the task in consideration and the target model to be learned. The current task is 
a reachability problem, and hence we utilize algorithms for computing a winning strategy in 
reachability games to synthesize symbolic controllers. However, there is nothing inherent in 
the architecture that prevents synthesis of the control using winning strategies of other types of 
games, such as Buchi games [48], [49]. Similarly, as in this paper the rules of the environment are 
encoded in strictly /c-local grammar, the learning module operates on string extension languages. 
However, any language that is identifiable from positive presentation can be considered. The 
main difference compared to our learning module and other machine learning methods — such as 
reinforcement learning and Bayesian inference — is that we take advantage of prior knowledge 
about the structure of the hypothesis space. This assumption enables the development of faster 
and more efficient learning algorithms. 
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